As is usual, please render the html file and submit it on Canvas. For this assignment, a dataset containing obesity data from individuals in several countries. The file is already uploaded to RStudio Cloud as “Obesity_data.csv”. Read it in to R as a data frame named “obesity”.
obesity =read_csv("Obesity_data.csv")
Rows: 2111 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): Gender, family_history_with_overweight, SMOKE
dbl (3): Age, Height, Weight
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Gender Age Height Weight
Length:2111 Min. :14.00 Min. :1.450 Min. : 39.00
Class :character 1st Qu.:19.95 1st Qu.:1.630 1st Qu.: 65.47
Mode :character Median :22.78 Median :1.700 Median : 83.00
Mean :24.31 Mean :1.702 Mean : 86.59
3rd Qu.:26.00 3rd Qu.:1.768 3rd Qu.:107.43
Max. :61.00 Max. :1.980 Max. :173.00
family_history_with_overweight SMOKE
Length:2111 Length:2111
Class :character Class :character
Mode :character Mode :character
Several of the variables are binary and categorical. In the case of these variables, please consider getting rid of those variables. Please make sure to retain the original dataframe. Because data in a clustering problem MUST be numeric, we will only retain numerical variables. Once we form the clusters you will append the clusters back to the original dataframe.
Q1 Is there any missingness in the dataset? Recall that we must address any missingness prior to clustering. If you discover any missingness, use row-wise deletion to eliminate it.
summary(obesity)
Gender Age Height Weight
Length:2111 Min. :14.00 Min. :1.450 Min. : 39.00
Class :character 1st Qu.:19.95 1st Qu.:1.630 1st Qu.: 65.47
Mode :character Median :22.78 Median :1.700 Median : 83.00
Mean :24.31 Mean :1.702 Mean : 86.59
3rd Qu.:26.00 3rd Qu.:1.768 3rd Qu.:107.43
Max. :61.00 Max. :1.980 Max. :173.00
family_history_with_overweight SMOKE
Length:2111 Length:2111
Class :character Class :character
Mode :character Mode :character
There is no missing data.
Q2 Create a new data frame to hold scaled values of all of the variables in the obesity data frame. Note: You do not need to exclude any variables from the scaling.
Age Height Weight
Min. :-1.6251 Min. :-2.69737 Min. :-1.8169
1st Qu.:-0.6879 1st Qu.:-0.76821 1st Qu.:-0.8061
Median :-0.2418 Median :-0.01263 Median :-0.1369
Mean : 0.0000 Mean : 0.00000 Mean : 0.0000
3rd Qu.: 0.2659 3rd Qu.: 0.71579 3rd Qu.: 0.7959
Max. : 5.7812 Max. : 2.98294 Max. : 3.2994
Q3 Use the NbClust function to determine the “optimal” number of clusters for this dataset.
Q4 Using the number of clusters that you identified in Question 3, create the clusters. Please set the random number seed to ‘123’
set.seed(123)fit =kmeans(obesity_s, 3)
Q5 Attach the clustering you created in Question 4 back to the ORIGINAL (not scaled) data frame.
summary(fit$cluster)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 1.000 2.000 2.064 3.000 3.000
obesity$cluster = fit$clustersummary(obesity)
Gender Age Height Weight
Length:2111 Min. :14.00 Min. :1.450 Min. : 39.00
Class :character 1st Qu.:19.95 1st Qu.:1.630 1st Qu.: 65.47
Mode :character Median :22.78 Median :1.700 Median : 83.00
Mean :24.31 Mean :1.702 Mean : 86.59
3rd Qu.:26.00 3rd Qu.:1.768 3rd Qu.:107.43
Max. :61.00 Max. :1.980 Max. :173.00
family_history_with_overweight SMOKE cluster
Length:2111 Length:2111 Min. :1.000
Class :character Class :character 1st Qu.:1.000
Mode :character Mode :character Median :2.000
Mean :2.064
3rd Qu.:3.000
Max. :3.000
Q6 Using the clustering you attached in Question 5, create the following plots (fill color by cluster): a) height versus weight b) age versus height c) age versus weight
plot1 =ggplot(obesity, aes(x=Height, y = Weight, color = cluster)) +geom_point()ggplotly(plot1)
plot2 =ggplot(obesity, aes(x=Age, y = Height, color = cluster)) +geom_point()ggplotly(plot2)
plot3 =ggplot(obesity, aes(x=Age, y = Weight, color = cluster)) +geom_point()ggplotly(plot3)
Q7 Do there appear to be patterns in the data that might suggest obesity?
There seems that the younger you are the more likely you are to be obese. It also seems that the taller you are the more likely you are to be obese. So overall I can say that you are more likely to be obese if you are younger and taller.